HTML Parsing

HTML parsing is the process by which the browser reads your raw HTML text (.html file) and converts it into a structured, in-memory representation called the DOM (Document Object Model).

1. HTML Source Code Arrives

When you open a webpage, the browser downloads the HTML file from the server.

<!DOCTYPE html>
<html>
  <head>
    <title>My Page</title>
  </head>
  <body>
    <h1>Hello</h1>
    <p>Welcome to HTML parsing!</p>
  </body>
</html>

This is plain text — not yet structured or rendered.

2. Tokenization

The browser’s HTML parser starts reading this text character by character.

It breaks it into tokens — each representing an HTML construct:

Start tags (<html>, <body>)
End tags (</body>)
Text nodes (Hello, Welcome...)
Comments, attributes, etc.

< !DOCTYPE html >
< html >
< head >
< title >
My Page
</ title >
< body >
< h1 >
Hello
</ h1 >
...

3. DOM Tree Construction

As the tokens are recognized, the browser creates nodes and connects them hierarchically to build the DOM tree.

Document
└── html
    ├── head
    │   └── title: "My Page"
    └── body
        ├── h1: "Hello"
        └── p: "Welcome to HTML parsing!"

A tree structure representing all elements and their relationships.
JavaScript and CSS interact with this tree.

1. HTML Source Code Arrives​

2. Tokenization​

3. DOM Tree Construction​

1. HTML Source Code Arrives

2. Tokenization

3. DOM Tree Construction